Skip to content

cxo: fix Cache.cleanDown eviction bug and skip re-encoding unchanged sub-trees#2420

Merged
0pcom merged 2 commits intoskycoin:developfrom
0pcom:fix/cxo-cache-clean-down-and-publisher
May 3, 2026
Merged

cxo: fix Cache.cleanDown eviction bug and skip re-encoding unchanged sub-trees#2420
0pcom merged 2 commits intoskycoin:developfrom
0pcom:fix/cxo-cache-clean-down-and-publisher

Conversation

@0pcom
Copy link
Copy Markdown
Collaborator

@0pcom 0pcom commented May 3, 2026

Summary

Two CXO fixes that together drop a public visor's GC CPU share from ~60% to baseline and substantially reduce live heap. Found via pprof on a public visor running the user-feeds + visor-stats publishers; both fixes apply to every CXO node in the network, not just the visor.

1. cxo/skyobject: Cache.cleanDown never actually evicted anything

pkg/cxo/skyobject/cache.go:341,345 — the filter loop's "skip wanted" / "skip filling" comments said skip, but the code returned, aborting the entire ranking on the first non-evictable entry encountered during map iteration. On a busy node there is virtually always something filling or being awaited, so the cache never evicted: putItem invoked cleanDown once the budget was exceeded, cleanDown built a len(c.is)-sized rank slice and immediately discarded it, and putItem then inserted the new item on top of an already-over-budget cache.

The discarded slice was 95.6% of total bytes allocated under cleanDown in the alloc profile; the un-evicted backlog was the dominant share of live heap under putItem. Fix: continue, not return. Also switched the rank slice to []rankItem (value-slice) to drop a per-entry heap allocation.

This is a pure bug fix and benefits every skyobject.Container — every CXO publisher, subscriber, the TPD aggregator, and the standalone cxo daemon.

2. cxo/treestore: cache encoded sub-node hashes across publishes

Publisher.publishRoot re-walked the entire in-memory tree on every dirty batch and ran encoder.Serialize + entry.Sub.SetValue at every level, even when only a single leaf had changed. The package doc already promised content-addressed re-use of unchanged sub-trees ("only the O(depth) chain of objects from the modified leaf up to the root gets re-encoded"); the implementation didn't match.

Adds a per-memNode encode-cache (pubHash + cached) that's populated after each successful Save and invalidated by putAt/deleteAt/pruneAt along the touched path. encodeNode now places the cached hash directly into the parent's TreeEntry for any unchanged sub-node, skipping the recursive walk and the encoder.Serialize/Cache.Set chain entirely. Container.Save's reference walk increments the rc on the cached hash without descending into it (skyobject/unpack.go:156-167), so the underlying CXO objects stay alive across publishes as long as each successive Root references the same hash.

Cache promotion is deferred until after Save returns nil so an aborted publish doesn't leave the cache pointing at hashes that up.Close has rolled back.

Benefits every treestore.Publisherpkg/visor/init_stats.go, pkg/visor/cxo_user_feeds.go, cmd/apps/skychat/pairing/pair.go.

Test plan

  • go test ./pkg/cxo/... (treestore, skyobject, node, data, idxdb, cxds — all green)
  • go build ./...
  • New TestEncodeCacheInvalidatesOnlyAffectedPath pins the sibling-preservation contract: a Put under one branch must not change pubHashes on unrelated subtrees, and cached must clear only on the touched path.
  • Existing TestPublisherUnchangedSubtreeKeepsHash still passes — the doc-comment guarantee is now actively enforced by the cache fast path rather than only emergent from CXO content-addressing.
  • Verify on a live public visor: re-run pprof after deploy and confirm GC CPU share drops and Cache.putItem live heap shrinks.

0pcom added 2 commits May 3, 2026 13:05
…lling item

The filter loop's "skip wanted" / "skip filling" comments said skip,
but the code returned, aborting the entire ranking on the first
non-evictable entry encountered during map iteration. On a busy
visor there is virtually always something filling or being awaited,
so the cache never evicted anything: putItem invoked cleanDown once
the budget was exceeded, cleanDown built a len(c.is)-sized rank slice
and immediately discarded it, and putItem then inserted the new item
on top of an already-over-budget cache. The discarded slice is what
the alloc profile pinned at 95% of total bytes allocated under
cleanDown, and the un-evicted backlog is what pinned the live heap
under putItem.

continue, not return. While here, switch the rank slice from
[]*rankItem to []rankItem so we don't allocate a tiny heap object
per cache entry.
Publisher.publishRoot re-walked the entire in-memory tree on every
dirty batch and called encoder.Serialize + entry.Sub.SetValue at
every level, even when only a single leaf had changed. With the
visor stats publisher running at 1-min cadence this dominated CPU
(via GC pressure from the Serialize churn) and live heap (the
encoded TreeNode bytes piling up in Cache.putItem).

Add a per-memNode encode-cache (pubHash + cached) that is populated
after each successful Save and invalidated by putAt / deleteAt /
pruneAt along the touched path. encodeNode now places the cached
hash directly into the parent's TreeEntry for any unchanged
sub-node and skips the recursive walk plus the
encoder.Serialize / Cache.Set chain entirely. Container.Save's
reference walk increments the rc on the cached hash without
descending into it (skyobject/unpack.go:156-167), so the underlying
CXO objects stay alive across publishes as long as each successive
Root references the same hash — exactly the content-addressed
behavior the package doc already promised.

Cache promotion is deferred until after Save returns nil so an
aborted publish doesn't leave the cache pointing at hashes that
up.Close has rolled back.

Adds TestEncodeCacheInvalidatesOnlyAffectedPath to pin the
sibling-preservation contract.
@0pcom 0pcom merged commit f0a45db into skycoin:develop May 3, 2026
3 of 4 checks passed
0pcom added a commit that referenced this pull request May 4, 2026
…y from stats bbolt (#2427)

The per-transport CSV log store wrote a row to a daily file on every
packet via gocsv reflection-based marshalling. On a public visor that
turned into the dominant alloc-rate symbol — pprof showed
transport.fileTransportLogStore.writeToCSV at ~78% cumulative bytes
allocated, dwarfing every other hot path once the CXO publisher's
own allocation churn was fixed (#2420). The CSV files were also
narrower than the bbolt-backed stats store that already exists
(no latency, less precise timestamps), and the stats store was
already plumbed into the CXO publisher feeding TPD's aggregator.

Drop the file-backed log stores entirely:

- pkg/transport/log.go — remove fileTransportLogStore and the
  parallel fileLatencyLogStore (the latter was wired into
  ManagerConfig but never read by anything — pure dead weight).
  CsvEntry / LatencyCsvEntry gocsv structs go with them. The
  in-memory LogStore stays — ManagedTransport uses it to preserve
  cumulative byte counters across same-session reconnects of the
  same transport ID.

- pkg/transport/manager.go — drop the unused LatencyLogStore
  field on ManagerConfig.

- pkg/visor/init_transport.go — collapse the file/memory branch to
  always-memory; drop the latency-store init. LogStore.Type=file
  in existing visor configs is accepted but logged as deprecated.

- pkg/visor/api_transport.go — re-implement GetTransportLogs to
  walk stats.Store.AllTransportRecords() and emit one entry per
  (transport, day) within the requested window. The CSV path's
  RecvBytes/SentBytes was the within-day delta (CSVs were
  zero-anchored at each day rollover), which matches DailyRollup
  semantics exactly — `skywire-cli tp -b N` aggregation is
  unchanged. Drop the local csvEntry parser and readTransportLogFile.

- pkg/visor/logserver/api.go — drop the /transport_logs/:file
  HTTP route and its landing-page listing. Same data is served by
  /stats/transports/history (live snapshot at /stats/transports).
  The unused tpLogPath parameter on logserver.New stays in the
  signature for call-site compat (init_dmsg.go passes it).

- pkg/skyenv/skyenv.go — drop the unused LatencyLogStore directory
  constant; leave TpLogStore in place since gen.go still emits it
  as the LogStore.Location default for backward-compatible configs
  (the runtime ignores the path — nothing is written there).

Cumulative bytes counters on per-transport status (skywire-cli tp ls,
hypervisor UI) still reflect "bytes since transport became live in
this visor session" — same in-memory semantics that file-mode users
also got within a single session. Across visor restarts everyone
starts fresh, which is what memory-mode users were already
experiencing.

Tests: package tests pass (transport, visor, visor/logserver,
visor/stats, visor/visorconfig). The only remaining file-mode test
(TestFileTransportLogStore) is removed along with the
implementation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant